Dimension Reduction

A Quick Tour with Examples

USM Data Science

2026-02-04

What is Dimension Reduction?

Consider a typical machine learning situation

  • one response variable (or none - unsupervised)
  • one or more predictors (features, variables)

penguins data1.

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Gentoo Biscoe 47.6 14.5 215 5400 male 2007
Adelie Dream 37.2 18.1 178 3900 male 2007
Adelie Dream 38.8 20.0 190 3950 male 2007
Gentoo Biscoe 53.4 15.8 219 5500 male 2009
Chinstrap Dream 50.9 19.1 196 3550 male 2008
Gentoo Biscoe 48.7 14.1 210 4450 female 2007

We are interested in the dimension of the predictor space.

Think linear algebra

Dimension of the column space

An Example

Sakar et al. (2019)

  • 252 patients, 188 of whom had a previous diagnosis of Parkinson’s disease, 64 without.
  • speech signal processing techniques yield 750 numerical predictors, or features.


Goal: classify Parkinson’s status.


Can we fit a logistic regression model?

The parkinsons design matrix

  • 751 rows and 252 + 1 = 253 columns (one for the intercept)

  • More columns than rows!

  • So data matrix \(X\) not full rank - a problem for some ML models

  • \(p>>n\) is common in many ML applications (e.g., genomics, image analysis, text mining, etc.)

Feature selection - one form of dimension reduction

Even if \(X\) full rank, we might want eliminate some predictors to decrease model variance.

  • near - zero variance predictors

  • highly correlated predictors (with each other)

  • predictors uncorrelated with the response

Many methods…

Another approach: Feature engineering

More generally, can we transform and combine predictors in various ways to form a smaller set of new features?

Why do dimension reduction?

  • Reduce model variance - mitigate multicollinearity
  • Reduce model complexity
  • Improve model interpretability
  • Reduce computational cost
  • Mitigate the “curse of dimensionality”

Much of this is model dependent

Also used for

  • Visualize high-dimensional data
  • data compression
  • autoencoders - representation learning

Curse of dimensionality

Curse of dimensionality

Suppose KNN with \(p\) predictors \(\sim U(0,1)\)

  • volume of hypercube with sides of length \(d\) is \(d^p\).

  • To capture \(r\) proportion of the data, we want \(d^p = r\), so \(d = r^{1/p}\)

We are using the \(L_{\infty}\) norm (max norm) to define neighborhood here, but argument is similar for other norms.

Feature selection/engineering/extraction and Dimension reduction

Feature “selection”

  • select a subset of existing predictors or features

Feature engineering

  • create new predictors or features from existing ones- transformations, combinations, interactions, polynomials, etc.

Feature extraction

  • create a smaller set of features from the existing that capture the “information in the original predictors

Principal Components Analysis (PCA)

A Canonical Example

Principal Components Analysis (PCA)

Create features - components from linear combinations of the predictors

  • The first component is in the direction of maximum variance of the data*.

  • Second component in direction of max variance orthogonal (and thus uncorrelated) to the first, and so on.

*It thus important to scale predictors first if they are on different scales.

PCA is unsupervised!

PCA - the math

Predictors \(X_1, X_2, \dots, X_p\); form linear combinations

\[ Z_m = \sum_{j=1}^p \phi_{jm} X_j, \qquad m = 1, 2, \dots, p \]

In matrix form,

\[ Z = X \mathbf{U} \]

Now with data

\[ \mathbf{Z} = \mathbf{XU} \]

Principal components and eigenvectors

First scale or standardize the data matrix.

The principal components - the columns of \(\mathbf{U}\) - are the eigenvectors of \(\mathbf{X}^T \mathbf{X}\).

(Equivalently, the eigenvectors of the correlation matrix - standardizing is automatic.)

Theorem

  • Eigenvectors of \(\mathbf{X}^T \mathbf{X}\) are orthogonal

  • Let \[ \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_p > 0 \] with eigenvectors \(\mathbf{v}_1, \dots, \mathbf{v}_p\)

    • \(\lambda_i\) is variance in direction \(\mathbf{v}_i\)
    • \(\mathbf{v}_1\) maximizes variance
    • \(\mathbf{v}_2\) maximizes variance subject to orthogonality with \(\mathbf{v}_1\)
    • and so on

Principal componponents regression (PCR)

  1. Center and scale the data

  2. Form \(\mathbf{U}\), the eigenvectors of centered and scaled \(\mathbf{X}^T \mathbf{X}\), ordered by decreasing eigenvalues

  3. Compute \(\mathbf{Z} = \mathbf{XU}\)

  4. Select the number of components, \(m\) based on explained variance (or cross-validation).

  5. Fit a linear regression using the first \(m\) columns of \(\mathbf{Z}\) as predictors.

For \(m = p\), PCR equals ordinary least squares

In practice we will typically use software libraries to find the components

Example: wine data (UCI ML Repository)

Chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.


Class_label Alcohol Malic_acid Ash Alcalinity_of_ash Magnesium Total_phenols Flavanoids Nonflavanoid_phenols Proanthocyanins Color_intensity Hue OD280_OD315_of_diluted_wines Proline
1 14.2 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 13.2 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
1 13.2 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
1 14.4 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
1 13.2 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
1 14.2 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450

Multinomial regression on the PCA features

We get perfect score with just 5 components, and 98% with only 3 components.


num_pca_components accuracy
1 0.848315
2 0.960674
3 0.983146
4 0.983146
5 1.000000
6 1.000000
7 1.000000
8 1.000000
9 1.000000
10 1.000000
11 1.000000
12 1.000000
13 1.000000

PCA as a preprocessing step


  • on select predictors - not all or nothing.

  • preprocessing/feature engineering step for any ML model

  • Use tidymodels::recipes, sklearn PCA, prcomp

Dimension reduction for visualization, understanding global structure, variable relationships

PCA on USArrests dataset1


Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7

Loadings and scores


PCA loadings; the \(\mathbf{U}\) values

PC1 PC2 PC3 PC4
Murder -0.536 -0.418 0.341 0.649
Assault -0.583 -0.188 0.268 -0.743
UrbanPop -0.278 0.873 0.378 0.134
Rape -0.543 0.167 -0.818 0.089


Cumulative normalized variances

[1] 0.620060 0.867502 0.956642 1.000000


PCA scores, \(\mathbf{Z} = \mathbf{XU}\)

PC1 PC2 PC3 PC4
Alabama -0.975660 -1.122001 0.439804 0.154697
Alaska -1.930538 -1.062427 -2.019500 -0.434175
Arizona -1.745443 0.738460 -0.054230 -0.826264
Arkansas 0.139999 -1.108542 -0.113422 -0.180974
California -2.498613 1.527427 -0.592541 -0.338559
Colorado -1.499341 0.977630 -1.084002 0.001450
Connecticut 1.344992 1.077984 0.636793 -0.117279
Delaware -0.047230 0.322089 0.711410 -0.873113
Florida -2.982760 -0.038834 0.571032 -0.095317
Georgia -1.622807 -1.266088 0.339018 1.065974

Visualizing the loadings

Biplots

We can visualize the loadings and the scores together with a biplot.

Non-linear dimension reduction

Manifold learning

Uniform manifold approximation and projection (UMAP)

penguins revisited


We use just the four numerical predictors

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Gentoo Biscoe 47.6 14.5 215 5400 male 2007
Adelie Dream 37.2 18.1 178 3900 male 2007
Adelie Dream 38.8 20.0 190 3950 male 2007
Gentoo Biscoe 53.4 15.8 219 5500 male 2009
Chinstrap Dream 50.9 19.1 196 3550 male 2008
Gentoo Biscoe 48.7 14.1 210 4450 female 2007


PCA vs UMAP on penguins


UMAP may be overkill here

UMAP

mnist dataset - images of handwritten digits (0-9). 784 features (28 x 28 pixels)


mnist samples

PCA vs UMAP on mnist

mnist dataset - images of handwritten digits (0-9). 784 features (28 x 28 pixels)


More on UMAP

Other manifold learning

Other manifold learning methods

  • t-SNE - t-distributed stochastic neighbor embedding (only used for viz)
  • Isomap
  • Locally Linear Embedding (LLE)
  • neural networks/deep learning

Neural Networks

neural network diagaram

Source: Gerhard Paab blog

autoencoder diagaram Source: S. Raschka Deep Learning Learning

Coding notes

  • prcomp in base R, step_pca() recipes step in recipes package (part of tidymodels)

  • sklearn.decomposition.PCA

  • umap package in R; as a recipes() step

  • umap-learn not part of sklearn but part of sklearn ecosystem


See my slides source code as well as example R and ipynb files for example code.

Recap

  • Why dimension reduction matters

    • High-dimensional predictors and rank deficiency
    • Variance, multicollinearity, interpretability, computational cost
    • The curse of dimensionality (analytical + simulation intuition)
  • Feature selection, engineering, extraction

  • Principal Components Analysis (PCA)

    • Mathematics; eigenvectors and eigenvalues
    • Principal components regression (PCR)
    • PCA loadings and scores
  • Nonlinear dimension reduction

    • UMAP, neural networks